--- layout: page title: Script 7 - Regression Diagnostics permalink: /scripts/script7/ parent: R Scripts nav_order: 7 ---
Run the code in this entire script in your own local R script. Always annotate your code including the output or result of the code you run unless it is trivial or already discussed below.
Just like sample means, regression coefficients are estimates. Hence under (hypothetical) repeated sampling and subsequent OLS modelling, we would get a sampling distribution of beta coefficients, illustrating the variation in \(\hat{\beta}\). We want to know whether the observed coefficient derived from our sample is statistically distinguishable from the null hypothesis coefficient (which is \(\beta = 0\)).
First, we specify our null hypothesis. Our research hypothesis \(H_1\) is that \(X\) and \(Y\) are related, meaning \(\beta > 0\) or \(\beta < 0\) (that is, the alternative hypothesis). Hence, the null hypothesis (\(H_0\)) is usually \(\beta = 0\) (for two-tailed tests; for one-tailed tests, it would be that the mean \(\beta\) is larger or smaller than 0).
Then, we choose a statistic giving us a score that can be mapped onto a known distribution telling us the probability of observing this score from our sample if \(H_0\) were true. In linear regression, we calculate the following t-statistic (also called t-value):
\[ \text { T statistic }_{n-2}=\frac{\hat{\beta}-\beta_{\text {Null }}}{\text { SE of } \hat{\beta}} \]
Then, we set the significance cutoff point \(\alpha\) for rejecting the null hypothesis. \(\alpha\) is usually set at 0.05 (but it can also be 0.10 or 0.01 - these values are conventional values). Provided that the sample size is large enough and that we are doing a two-tailed test, we find the critical value for this cut off point from the t-distribution: 1.96 for \(\alpha\) = 0.05.
Lastly, we compare the t-value to the critical value: we reject the null hypothesis if the absolute value of our realised t-value is equal to or larger than 1.96. Mathematically:
|realised T-value| \(\geq 1.96\)
Let’s implement this with a simple bivariate regression. First load the packages and data we are going to need:
# setwd("~") # if using RStudio Desktop
brexit_data<-read.csv("brexit_data.csv")
install.packages("car")
install.packages("stargazer")
install.packages("lmtest")
install.packages("coefplot")
install.packages("multiwayvcov")
library(multiwayvcov)
library(coefplot)
library(lmtest)
library(car)
library(stargazer)
Now compute the t-statistic and its associated p-value for a
regression where age_mean predicts
Percent_Leave:
lm1 <- lm(Percent_Leave ~ age_mean, data = brexit_data)
summary(lm1)
# extract coefficients with the coef() function
coef(summary(lm1))
# extract coefficients with indexing [row, column]
# under the null = 0.
beta_diff <- coef(summary(lm1))[2,1] - 0
# minus 0 because 0 is the coefficient under the null hypothesis
beta_se <- coef(summary(lm1))[2,2]
#or equivalently, just copy paste
beta_diff <- 1.287577 - 0
beta_se <- 0.1681582
tstatistic <- beta_diff/beta_se
tstatistic
# compare this value to the regression output
# is the t-stat larger than 1.96?
# Note that pt() accesses values of the t-distribution.
pvalue <- 2*pt(-abs(tstatistic), df = nobs(lm1)-2)
pvalue
# We use pt because we use a t-distribution, and 2* because it's two-tailed.
# The degrees of freedom equals your sample size (nobs(lm1)) minus
# the number of parameters you need to calculate during an analysis
# We use the negative value because we 'start' looking at the cumulative function from the left. If you add the argument 'lower=FALSE', you can use the absolute value.
The p-value summarises our evidence against the null hypothesis. It is the the probability of observing a T-value at least as extreme as the one we just calculated given that the null hypothesis is true. A small p-value indicates that it is very unlikely the coefficient we estimated (\(\hat{\beta}\)) comes from the sampling distribution of a hypothetical \(\beta\) centered around 0. If the p-value is \(\leq\alpha\), we therefore reject the null hypothesis \(H_0\). The p-value is thus compared to the value of \(\alpha\) for which we reject \(H_0\).
It is common practice to use a t-distribution rather than a normal distribution in hypothesis testing, including in the case of testing the statistical significance of linear regression coefficients. This is because we need to account for sample size and number of regressors when we draw conclusions about the uncertainty of our coefficient estimates. By considering ‘degrees of freedom’, the t-distribution does just that. Note that the difference between the t-distribution and the normal distribution is negligible for reasonably large sample sizes (\(df>30\)). Simply speaking, the t-distribution is more conservative for small sample sizes.
We can also construct confidence intervals around our estimate \(\hat{\beta}\) applying the same logic as before:
\[ CI_{1-\alpha}=\hat{\beta} \pm t_{\alpha / 2} * S E(\hat{\beta}) \]
Typically, for \(n > 50\) and \(\alpha = 0.05\), \(t_{\alpha / 2} = 1.96\). Recall the interpretation: under hypothetical repeated sampling, the true \(\beta\) is lies within 95% of all intervals constructed in this way. Hence, we reject the null if the CI doesn’t contain \(\beta\) under the assumption that \(H_0\) is true - which usually would be 0.
We can calculate the confidence intervals manually:
# get the estimated coefficient
betahat <- coef(summary(lm1))[2,1]
conf_interval <- c(betahat-1.96*beta_se, betahat+1.96*beta_se)
conf_interval
#or alternatively, if we want to be super-precise
alpha <- 0.05
critical_value <- qt(1-alpha/2, df = nobs(lm1)-2)
# compute the critical value associated with a significance cutoff point
# of 0.05 (95% confidence level); nobs() gives you the number of observations
# included in lm1, subtract two to get degrees of freedom (because we use
# two variables in a bivariate model: X and Y)
conf_interval <- c(betahat-critical_value*beta_se,
betahat+critical_value*beta_se)
conf_interval
Once you have calculated the confidence intervals, you can plot them to get a better idea on how confident you are about rejecting \(H_0\).
#create a sequence of values of x
newx <- seq(min(brexit_data$age_mean, na.rm = T),
max(brexit_data$age_mean, na.rm = T),by = 0.05)
#use the predict function to get the confidence interval
conf_interval <- predict(lm1, newdata = data.frame(age_mean=newx),
interval="confidence")
# predict() takes all the values of x, you set by age_mean=newx,
# and returns a matrix with the fitted value in column 1,
# the lower bound in column 2, the upper bound in column 3.
plot(brexit_data$age_mean, brexit_data$Percent_Leave)
abline(lm1)
#plot lower bound
lines(newx, conf_interval[,2], col="blue", lty=2)
#plot upper bound
lines(newx, conf_interval[,3], col="blue", lty=2)
You don’t necessarily need to know how to compute these, because they will be calculated automatically in your regression/t-test. But it’s good to know where these numbers are coming from.
# If I have a test statistic (e.g. t-value) and want to compute its p-value,
# use pt() for a t-distribution (or pnorm() for a normal distribution).
# pt() requires you to pass degrees of freedom. You can get those from
# the output of a t-test or a linear regression output.
# remember to pass the t-statistic as a negative number, using -abs(t_stat)
# remember to multiply by 2 if you want a two-tailed p-value
summary(lm1)
t_value <- 0.385
p_value_tdist <- 2*pt(-abs(t_value), df = 342)
p_value_normdist <- 2*pnorm(-abs(t_value))
# If have a significance cutoff point alpha (0.05, 0.01 etc.) associated with
# a certain confidence level (95%, 99% etc.) and you want to calculate the
# associated critical value, use qt() or qnorm().
alpha <- 0.05
qt(1-alpha/2, df = 342)
qnorm(1-alpha/2)
We now can proceed to an interpretation of multivariate regressions from the Brexit data, assessing the statistical significance of each regression coefficient according to the three methods above: calculating the T-statistic; comparing the T-statistic to the critical value from the z-distribution for a given level of significance or comparing the chosen level of significance to the p-value; and computing confidence intervals to see whether they include 0 or not. Try it with one coefficient estimate of your choice!
lm2 <- lm(Percent_Leave ~ age_mean + Birth_UK_percent +
Married_percent, data = brexit_data)
summary(lm2)
# interpret the output
# can you reject the null hypothesis for each coefficient?
# at which significance level?
The coefplot function comes in really handy to plot the
confidence intervals for multiple regression coefficients:
coefplot(lm2,
intercept = FALSE,
# ci_levels=c(0.9,0.95), 90%+95% CI levels are the default
title = "Coefficient Plot (Estimates + CIs)",
xlab = "Estimate & CIs",
ylab = "Coefficient",
newNames=c(age_mean="Mean Age", Birth_UK_percent="UK-Born (%)", Married_percent="Married (%)"),
decreasing = TRUE,
col="blue",
sort = "magnitude") # note we exclude the intercept
When a set of key assumptions hold, OLS gives us the most efficient and least biased estimate of the true relationship between dependent and independent variables. By bias, we mean the expected difference between the estimates caclulated by our estimator (the rule we use to estimate the parameters) and the population parameter (e.g., the true relationship between \(X\) and \(Y\)). The least biased estimator minimises this difference. The model that yields the least variance in explaining the population parameters is the most efficient (this makes sense intuitively, as variance is a measure of how far the coefficient estimates potentially are from the true \(\beta\) in the population). A lower variance essentially indicates that under repeated sampling the estimates are likely to be similar.
This script provides you with a brief explanation and exploration of the required assumptions, and guidance on how to check whether they hold using R.
The key OLS assumptions for unbiasedness and efficiency are: 1) random sampling, 2) no perfect collinearity, 3) zero conditional mean, 4) homoskedasticity, and 5) normality of the error term.
Checking that these assumptions hold for our models is crucial. If they do not, we may get biased or inefficient estimates of the coefficients and/or standard errors. When we find that these assumptions are violated, however, we can make adjustments to our model specification to help us make (more) valid inferences — some of which we will address below.
Most of regression diagnostics entails testing various aspects of the residuals of our predictions. Together, these tests determine whether the above assumptions hold - hence whether OLS is BLUE - i.e. the best (i.e. efficient) linear unbiased estimator.
The assumption of no perfect collinearity requires that there are no exact linear relationships between the independent variables. Exact linear relationship means you can write an explanatory variable as a linear combination of the other explanatory variables (e.g. one variable is a multiple of another, or the sum of some of the others). If you have perfect correlation, only one of the two variables can be estimated.
More commonly, you may have high (but not perfect) correlation between two or more independent variables. This is called multicollinearity. While this does not constitute a violation of any of the assumptions needed for OLS to be the best linear unbiased estimator, it will inflate the standard errors, and is therefore something to look out for. Moreover, in presence of multicollinearity, coefficients swing wildly depending on which predictors are in the model, making it really sensitive to small changes in the modelling specification. All other things equal, to estimate \(\\hat{beta_j}\), it is preferable to have less correlation between \(X_j\) and the other independent variables.
Take this ‘silly’ example. Let’s add both the mean and median age of the local authority as independent predictors of the ``Leave``` vote.
lm3 <- lm(Percent_Leave ~ age_mean + age_median, data = brexit_data)
summary(lm3)
#compare with age_mean only model (lm1)
stargazer(lm1, lm3, type = "text")
What happens to the standard error of mean age? How much does the \(R^2\) change? Do you think it’s worth having much more imprecise estimators for such a change in goodness-of-fit? Does the substantive interpretation of the coefficients make any intuitive sense to you?
This does not just happen with variables that measure exactly the same thing, like mean and median age. Even variables that ostensibly capture different characteristics of our observations can be so tightly related to each other that using both in the same model becomes redundant. Let’s illustrate this by running some further multivariate regressions with multicollinear variables. The idea is that the variables used in the model below are highly (but not perfectly) collinear - i.e. they move together in a non-perfectly linear way. Which of the variables in our dataset may be collinear? For example, median earnings in an area are likely to be highly correlated with the area’s occupational structure, and in particular the percentage of individuals working in higher management, administration and professorial positions. Let’s check what happens when we include both in a regression:
lm4 <- lm(Percent_Leave ~ Health_very_good_percent +
Bachelors_deg_percent + Occup_high_manage_admin_profess_percent,
data = brexit_data)
summary(lm4)
lm5 <- lm(Percent_Leave ~ Health_very_good_percent +
Bachelors_deg_percent + Occup_high_manage_admin_profess_percent + Earnings_Mean,
data = brexit_data)
summary(lm5)
# compare the two models using stargazer
stargazer(lm4, lm5, type = "text")
# what happens to the standard errors? What happened to R-squared?
lm6 <- lm(Percent_Leave ~ age_mean + Married_percent, data = brexit_data)
#compare with original age-only model (lm1)
stargazer(lm1, lm6, type = "text")
#same questions. what do you conclude about age and marriage rates?
The coefficients are less precise (look at the standard
errors) than those found in previous models using the
Brexit data. Again, this does not
constitute a violation of any of the assumptions (OLS is still
unbiased), but the coefficients are less precisely estimated, making
statistical inference more difficult.
If you suspect there is multicollinearity between some of your independent variables, as a first step you can look at the correlation between your independent variables. Alternatively, you can regress the suspect (let’s call it \(X_j\) for now) on the other independent variables, and look at the R-squared from that regression - which gives you the proportion of the total variation in \(X_j\) that can be explained by the other independent variables. A higher R-squared in the regression which takes as outcome variable the suspect independent variable reflects multicollinearity with other independent variables in your model.
It is also possible to test directly for multicollinearity by measuring the Variance Inflation Factor (VIF) for each independent variable in a model (See Finley and Agresti, section 14.3 for details). A VIF of 1 indicates that there is no correlation between this independent variable and the other predictors in the model. VIFs between 1 and 5 suggest that there is moderate, but acceptable correlation: no further action is needed. VIFs greater than 5 are problematic: coefficients are very imprecise and sensitive to minor changes in the model specification, and the p-values are questionable, rendering hypothesis testing non-robust.
# create a correlation matrix between the health, occupation and education variables in lm4
# first create a dataframe with only the relevant variables, and then pass this dataframe as
# argument of the cor() function, setting use = "complete"
cor(data.frame(health = brexit_data$Health_very_good_percent , occupation = brexit_data$Occup_high_manage_admin_profess_percent, education = brexit_data$Bachelors_deg_percent), use = "complete")
#Alternatively, compute the Variance Inflation Factor with the vif() function from the "car" package.
vif(lm4)
# Note that education (bachelor's degrees) and occupation are likely the problematic pair
# VIF are above 5 for all independent variables, especially education,
# suggesting severe issues of multicollinearity. And obviously, the problem
# is even worse for our redundant lm3 (age_mean + age_median)
vif(lm3)
What can be done about this? If multicollinearity is an artifact of small sample size, a solution would be to increase sample size - but that is usually not feasible. You can also drop one of the independent variables - but be careful: if you drop a variable that is a predictor of your outcome in the population, this can bias the remaining estimates! Remember that the inclusion of explanatory variables has to be motivated by theory and on the basis of their relationship with a) the outcome, and b) other explanatory variables included in the model. Most commonly, when the collinear variables are ‘controls’ – i.e. predictors that you need to add to account for omitted variable bias, but are not your main substantive interest – you should present alternative model specifications with each of the collinear variable in isolation as ‘robustness checks’.
This is the most crucial assumption: if it does not hold, our OLS estimates will be biased. This assumption means that the error term \(\epsilon\) must not show any systematic pattern but have an expected value of zero given any value of the explanatory variable \(X_i\). Formally, this writes:
\[ E(\epsilon | X_i) = 0\] This means that the expected value of the error term conditional on the independent variable \(X_i\) is zero. In other words, other factors not included in the model and affecting our outcome \(Y\) are not related on average (i.e. in expectation, or systematically) with any of the independent variables \(X_i\) included in your model. Put slightly differently, there should be no discernible correlation between the unexplained part of your outcome (i.e. the error term) and the independent variables which are used to explain the outcome. Omitting an important variable that is related to any of your explanatory variables \(X_i\) and the outcome will violate this assumption.
Think of how we identified confounders/omitted variables before: we plotted education against our residuals, and saw a relationship. That meant that this variable should be included in our model (once included, it will no longer correlate with the residuals).
Let’s visualize this by plotting the residuals against each of the independent variables in two of our models (lm4 and lm6), as well as against the fitted (predicted) values of the dependent variable. We also plot the residuals against the fitted values because if the zero conditional mean assumption is violated for one independent variable, it is likely that the predicted values of the dependent variable will also display a non-zero conditional mean.
summary(lm4)
summary(lm6)
#use the residualPlot() function from the car package
residualPlot(lm4, variable = "Health_very_good_percent")
residualPlot(lm4, variable = "Bachelors_deg_percent")
residualPlot(lm4, variable = "Occup_high_manage_admin_profess_percent")
residualPlot(lm4, variable = "fitted")
# the distribution of residuals is roughly (and bar a single outlier) symmetric: this is good!
residualPlot(lm6, variable = "age_mean")
residualPlot(lm6, variable = "Married_percent")
residualPlot(lm6, variable = "fitted")
# residuals are above zero for intermediate values of the
# independent variables, particularly Married_percent
# and the predicted values of Percent_Leave. This indicates that the zero
# conditional mean assumption is violated.
We can see that in our first model that the lines are (roughly)
straight: there is no correlation between the residuals and each of the
independent variables, or our residuals and the fitted values of the
dependent variable. In the second model, with age and marriage rates, we
get an inverted-U shape for the relationship between the residuals and
our fitted values. The zero conditional mean assumption seems to be
particularly violated for the Married_percent variable. The
reason for this might lie in the non-linearity of the relationship
between our variables.
Note that this approach does not provide a definitive test of the zero conditional mean assumption as the true error term remains unknown. In fact, we cannot credibly test this assumption with data alone - the assessment of this assumption comes down to theory, reasoning, and qualitative aspects of research design (most prominently, focusing on assignment rules and causal inference).
When the zero conditional mean assumption holds, we say that we have exogenous explanatory variables. If, on the other hand, an independent variable \(X_i\) is correlated with the error term for any reason, then \(X_i\) is said to be an endogenous explanatory variable. Note that there are other ways that the error term can be correlated with an explanatory variable - for instance if you have measurement error in an explanatory variable, or when one or more explanatory variables are jointly determined with your outcome. For now however, we leave these issues aside.
Note that our residuals are not constant across values of \(X_i\), indicating that the variance of the error term is not constant. This leads us to our next stage of regression diagnostics: homoskedasticity.
Can you figure why our model with only a linear specification of marriage rates is overestimating Leave support in areas with very high and very low shares of population who is married?
Marriage rates are likely to be non-linearly correlated with
Leave support, perhaps because for the left side of the
distribution (low values of Married_percent) they capture
the share of young people, who vote Remain, but for the right side of
the distribution (high values of Married_percent) they
capture prosperity and households’ economic security, which is comonly
related to large Remain vote shares. If you know your British geography,
you’ll recognise these two types of places from the same residuals plot,
where instead of dots we plot local authority area: urban municipalities
for young professional types on the left, and leafy suburbs in wealthy
parts of South on the right, both places with low Leave
vote shares. We’ll be looking into such circumstances in a bit.
plot(brexit_data$Married_percent[!is.na(brexit_data$age_mean)],
residuals(lm6), type = "n")
text(x = brexit_data$Married_percent[!is.na(brexit_data$age_mean)],
y = residuals(lm6),
labels = brexit_data$Area[!is.na(brexit_data$age_mean)], cex = 0.5)
Homoskedasticity means that the error term has the same variance given any value of the explanatory variable. In other words, the variance of the error term, conditional on the explanatory variable, is constant for all combinations of values of the explanatory variables. Homoskedasticity is also referred to as “constant variance”. Formally, this writes as:
\[ Var(\epsilon | X_i) = \sigma^2\]
If this assumption is violated, that is the error term varies differently at certain values of the explanatory variable, we are dealing with heteroskadisticity (from Greek: homo = same, hetero = different, skedasis = dispersion). Homoskedasticity is not necessary to get unbiased coefficient estimates - so long as the zero conditional mean assumption is met! It is important still because in case of heteroskedasticity - i.e. if the error term varies at different levels of our explanatory variables - we will get biased standard errors for our coefficient estimates! Note that the formula for the variance of \(\beta_i\) requires constant variance of the error term conditional on \(X_i\). In case of heteroskedastitity, this invalidates the usual formula for the standard errors.
We can check whether this assumption holds by plotting fitted values against standardised residuals. See the next section for more details on standardised residuals, but in short, we are dividing residuals by their standard deviation to account for the fact that the residuals’ variance decreases as the corresponding realisation of \(X_i\) gets farther away from the average x-value (\(\overline{X_i}\)). This is because OLS is ‘better’ at fitting values at the ends of the range explanatory variables.
# plot standardized residuals vs. fitted values
spreadLevelPlot(lm4)
spreadLevelPlot(lm6)
The residuals are unevenly scattered along the
x-axis in both cases, which is the visual translation of
heteroskedasticity in our models. Alongside ‘eyeballing’
heteroskedasticity, you can also test for it using the Breusch-Pagan
test (bptest() in the lmtest package): the
null hypothesis in this case is that the residuals are distributed with
equal variance (homoskedasticity); so: if you get a low p-value, the
test is saying that heteroskedasticity is present in the model.
# Conduct Breusch-Pagan Test
bptest(lm4)
bptest(lm6)
What can you do if you encounter heteroskedasticity? Heteroskedasticity can be corrected for by adjusting how we calculate our standard errors. In particular, we can calculate “robust standard errors” that correct for the uneven variance of the residuals. Robust standard errors essentially weigh the residuals according to their relative importance to correct for imbalances in their dispersion. It is worth noting that robust standard errors typically (but not always) give us more conservative (i.e. larger) estimates of a coefficient’s standard error. However, keep in mind that robust standard errors do not fix problems stemming from model specification, and should not be used just to “be safe”: they should be used if there’s a good reason to use them, otherwise they will inflate the standard errors for no reason - and our model will be less efficient than it could be.
To get model fits with heteroskedasticity-corrected standard errors,
you proceed in two steps: (1) first, use the hccm()
function (from car) to get a heteroskedasticity-robust
coefficient covariance matrix, and (2) pass the original model and the
covariance matrix as arguments of coeftest to get the full
model with heteroskedasticity-robust standard errors.
robust_se <- hccm(lm4, type="hc1") #from the car package
lm4_robust_se <- coeftest(lm4, robust_se) #from the lmtest package
# compare to model with nonrobust standard errors
summary(lm4)
stargazer(lm4, lm4_robust_se, type = "text")
# note: you don't need ti pass type="hc1 in the hccm() function
# those standard errors (known as robust standard errors) are the
# default setting of the function
Usually, the OLS estimator requires our observations be independent from one another. However, there are a number of situations in which this assumption can easily be violated. For instance, think of the case when your observations are bound across the units that they were collected from. Typically, student performance is classroom-specific (teachers and peers have an effect, and affect the whole class similarly). So are municipality or country-level observations. To correct for when residuals vary by a group-level variable, we use cluster-robust standard errors that take into account this clustering of observations, i.e. allow us to relax the non-independence assumption. Note that if we have multiple levels of data, we typically cluster by the largest unit.
clustered_se <- cluster.vcov(lm4, brexit_data$Region) #from multiwayvcov package
# we cluster by region code
lm4_clustered_se <- coeftest(lm4, clustered_se)
# how did the standard errors change?
stargazer(lm4, lm4_clustered_se, type = "text")
Ultimately, the type of standard error you use depends on the structure of your data and the relationship between model residuals (clustering), and the correlation between model residuals and independent variables (homoskedasticity or heteroskedasticity).
This assumption means that the error term (\(\epsilon\)) is independent of the explanatory variables and is normally distributed with a mean of zero. Non-normality is the “least of our problems” in regression diagnostics because its violation does not entail bias or inefficiency of our OLS estimates. It matters for the calculation of p-values, but this is only a concern when the sample size is small. If it is large enough, the Central Limit Theorem entails that we can reasonably assume a normal distribution of the error terms.
Note that errors and residuals are not the same! The error term is an unobserved quantity based on the true population model - whereas residuals are the observed difference between the predicted or fitted value \(\hat{Y_i}\) and the true observation \(Y_i\). However, both the difference between the expected values and the observed values (error) and the difference between fitted values and the observed values (residuals) equal zero, on average, if the above-mentioned OLS assumptions hold.
The next test one should run is therefore to ensure that the residuals are roughly normally distributed. We already know how to produce simple residuals, but now we need the variance of the residuals to be standardized, so we will generate what are called “studentised” or “standardised” residuals. The standardised residual is the residual divided by its standard deviation. The reason why we standardise residuals is that the variance of the residuals at different values of the explanatory variables may differ, even if the variances of the errors at these different values are equal. Indeed, because of the way OLS works, the distribution of residuals at different data points (of the explanatory variable) may vary even if the errors themselves are normally distributed (intuitively, the variability of residuals will be higher in the middle of the domain than at the end). Hence, to compare residuals at different values of the explanatory variables (or even across regression models), we need to adjust the residuals by their expected variability i.e. standardize them.
In R, we can compute the standardized residual with
the rstandard function.
# standardized residuals
lm4_stdres <- rstandard(lm4)
lm6_stdres <- rstandard(lm6)
Taking into account the Central Limit Theorem, we can treat the distribution as a z-distribution. In other words, about 5% of values should be above 1.96 or below 1.96.
A good way of looking at the distribution of residuals is to produce a histogram or a density plot of the standardised residuals.
# standardized residuals
lm4_stdres <- rstandard(lm4)
summary(lm4_stdres) # range from -3.9 to 3.2
hist(lm4_stdres,freq=FALSE, breaks=seq(-5,4,0.25))
lines(density(lm4_stdres), lty="dotted", col="green", lwd=2)
lm4xfit<-seq(min(lm4_stdres), max(lm4_stdres), by=0.1)
lm4yfit<-dnorm(lm4xfit, mean=0, sd=1)
lines(lm4xfit, lm4yfit, col = "red", lwd = 2)
# Density Plot
plot(density(lm4_stdres))
While not perfectly normally distributed, our standardised residuals are more or less normally distributed, indicating that the assumption may hold. Again, keep in mind that this test, as the other you just learned about, are diagnostic tests but no definitive proofs of assumptions as we could only provide those if we had information about the underlying population.
| Assumption Violated | Issue | Diagnostic Test | Fix | |
|---|---|---|---|---|
| No Multicollinearity | Inflates SE | Regress the IV on other IVs, compute the VIF | Increase sample size, remove suspect | |
| Zero Conditional Mean | Biases Estimates | Plot residuals against fitted values | Include confounders or polynomial terms | |
| Heteroskedasticity | Biases SE | Plot standardised residuals against fitted values, run B-P test | Compute robust SE | |
| Normality of error term | Wrong p-value (for small samples) | Plot distribution of standardised residuals | None |